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TITLE OF THE INVENTION 

PROCESSOR HAVING PRIORITY CHANGING FUNCTION ACCORDING TO 
THREADS 



5 BACKGROUND OF THE INVENTION 
Field of the invention 

The present invention relates to a data processing device , 
such as a microprocessor or the like, and more particularly to 
an effective means for thread management in a multi-thread 
f==^10 processor. The multi-thread processor is a process capable of 
= i executing a plurality of threads either on a time multiplex basis 

, or simultaneously without requiring the intervention of software , 

J^j such as an operating system or the like. The threads constitute 

L,L a flow of instructions having at least an inherent program counter 

hi 

"|15 and permit sharing of a register file among them. 
\"i Prior art 

Many different methods are available for higher speed 
execution of a serial execution flow by upgrading effective 
parallelism to a higher level than the serial execution: (1) 
20 use of an SIMD (Single Instruction Multiple Data) instruction 
or a VLIW (Very Long Instruction Word) instruction for 
simultaneous execution of a single instruction into which a 
plurality of mutually independent processes are put together, 
(2) a superscalar method for simultaneous execution of a 
25 plurality of mutually independent instructions, (3) an 
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out-of-order execution method of preventing the degradation of 
effective parallelism and reducing stalls due to dependency among 
data and resource conflict by executing the flow on an instruction 
by instruction basis in a different order from that of the serial 
5 execution flow, (4) software pipelining to execute a program 
in which the natural order of the serial execution flow is 
rearranged in advance to achieve the highest possible level of 
effective parallelism, and (5) a method of dividing the serial 
execution flow into a plurality of instruction columns consisting 

LO of a plurality of instructions and having this plurality of 
instruction columns executed by a multi-processor or a 
multi-thread processor. (1) and (2) are basic methods for 
parallel processing, (3) and (4) , methods for increasing the 
number of local parallelisms extract, and (5) , a method for 

L5 extracting a general parallelism. 

Intel's Merced described in MICROPROCESSOR REPORT, vol. 
13, no. 13, Oct. 6, 1991, pp. 1 and 6-10, is mounted with a VLIW 
system referred to in (1) above, and is further mounted with 
a total of 256 64-bit registers , comprising 128 each for integers 

20 and floating points for use in the software pipelining system 
mentioned in (4) . The large number of registers permits 
parallelism extraction in the order of tens of instructions. 

Compaq's Alpha 21464 described in MICROPROCESSOR REPORT, 
vol. 13, no. 16, Dec. 6, 1991, pp. 1 and 6-11, is mounted with 

25 a superscalar referred to in (2) above, an out-of-order system 
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stated in (3) and a multi-thread system mentioned in (5) . It 
extracts parallelisms in the order of tens of instructions with 
a large capacity instruction buffer and reorder buffer, further 
extracts a more general parallelism by a multi-thread method 
5 and performs parallel execution by a superscalar method. It 
is therefore considered capable of extracting an overall 
parallelism. However, as it does not analyze the relationship 
of dependency among a plurality of threads, no simultaneous 
execution of a plurality of threads dependent on one another 
=^10 can be accomplished. 

'■ NEC ' s Merlot described in MICROPROCESSOR REPORT , vol . 14 , 

: no. 3, March 2000, pp. 14-15 is an example of multi-processor 

referred to in (5) . Merlot is a tightly coupled on-chip 
four-parallel processor, executing a plurality of threads 
.15 simultaneously. It can also simultaneously execute a plurality 
: of threads dependent on one another. In order to facilitate 

dependency analysis, there is imposed a constraint that a new 
thread is generated only by the latest existing thread and the 
new thread comes last in the order of serial execution. 
20 A CPU (Central Processing Unit) in the "speculative 

parallel instruction threads" in JP-A-8-249183 is an example 
of multi-thread processor referred to in (5) . It is a 
multi-thread processor for simultaneously executing a main 
thread and a future threads. The main thread is a thread for 
25 serial execution, and the future thread, a thread for 
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speculatively executing a program to be executed in the future 
in serial execution. Data on a register or memory to be used 
by the future thread are data at the time of starting the future 
thread, and may be renewed by the starting time of the future 
5 thread in serial execution. If they are renewed, because the 
data used by the future thread will not be right, the result 
of the future thread will be discarded, or if not, they will 
be retained. Whether or not renewal has taken place is judged 
by checking the program flow until the future thread starting 
HlO time in possible serial execution by the directions of condition 
j:: branching and according to whether or not it is a flow to execute 

IZ an renewal instruction. For this reason, it has the 

iy 

£:j characteristic of requiring no analysis of dependency among the 

plurality of threads. 

?iii5 

SUMMARY OF THE INVENTION 

For instance, a program shown in Fig. 1 is a program for 
adding eight data. A processor for executing this program is 
supposed to have repeat control instructions like the ones shown 

20 in Fig. 2. If a repeat structure is configured of these 

instructions before the execution of a repeat, repeat control 
instructions such as a repeat counter updating instruction, a 
repeat counter check instruction and a condition branching 
instruction need not be executed during the repeat. Such repeat 

25 control instructions are usual for digital signal processors 
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(DSPs) and can be readily applied to general purpose processors 
as well. 

A case is considered in which this program is executed 
by a two-issued superscalar processor of 4 in load latency in 
5 a pipeline configuration shown in Fig. 3. In the drawing, 

reference sign I denotes an instruction fetch stage; DO and Dl , 
instruction decode stages; E, an execution stage for addition, 
store and the like ; and LO through L3 , load stages . The pipeline 
operation takes place as shown in Fig. 4. Referring to Fig. 

10 4 , instruction #7 is an instruction to load data from the address 
of a register rO to a register r2 and update the register rO 
to the next address. Decoding takes place at the instruction 
decode stage DO, loading is executed in a four-phase cycle of 
load stages LO through L3 , loaded data become usable at the end 

15 of the L3 stage. At the same time, address updating is executed 
at the LO stage, and the updated address becomes usable at the 
end of the LO stage. On the other hand, instruction #8 is an 
instruction to execute addition between the register r2 and the 
register r3 and store the result into the register r3 . Decoding 

20 takes place at the instruction decode stage Dl, addition is 
performed at the execution stage E , and the result becomes usable 
at the end of the E stage. Instruction #8 executes the E stage 
at the next phase of the cycle to the L3 stage of instruction 
#7 to use the result of loading by instruction #7. Since load 

25 latency cannot be concealed, addition of N data takes 4N + 2 



cycles. With the load, latency being denoted by L, this means 
LN + 2 cycles. If an access to an external memory is supposed 
and a load latency of 30 for instance, addition of N data will 
take 30N + 2 cycles. 

Then, if an out-of-order executing function, such as Alpha 
21464 mentioned above , is added to the processor , at a load latency 
of 4, the operation will be as shown in Fig. 5 and completed 
in N + 5 cycles, at a load latency of 30, in N + 31 cycles, or 
at a load latency of L, in N + L + 1 cycles. However, to meet 
a load latency of 30 , 60 instruction levels have to be rearranged. 
If N is set to 30 or above in the program of Fig. 1, the 30 load 
instructions will be executed while holding 30 ADD instructions 
out of the 60 instructions in an instruction buffer, and the 
result will be written back in the original execution order after 
the execution of the ADD instructions. For this reason, a large 
capacity instruction buffer and reorder buffer, such as those 
in Alpha 21464 are required, inviting a drop in the 
cost-effectiveness of the processor. 

If the program of Fig. 1 is increased in speed by a software 
pipelining method, such as Merced referred to above, at a load 
latency of 4, the operation will be as shown in Fig. 6. The 
pipeline will be as shown in Fig. 7, and the program will be 
completed in five cycles as in the case of the out-of-order 
execution described above. In this case, three more registers 
are used than in the program of Fig. 1 , and to meet a load latency 
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of 30, the program should be altered into one using 29 extra 
registers. The number of execution cycles will be N + 31. Thus 
a software pipelining system requires a large number of registers 
and optimization matching the latency length . In general terms , 
5 the number of execution cycles will be MAX (1, L-X+ 1)N+ 
MAX (L, X) +1 cycles, wherein X is the load latency supposed 
by the program and L , the actual load latency length . The function 
expressed in the MAX (expression 1, expression 2) form is the 
maximum selecting function, according to which the greater of 

10 expression 1 and expression 2 is selected. If too low a latency 
length is supposed, the first term will increased, but if too 
long a latency is supposed the second term will increase and, 
moreover , invite a waste of registers . As the length of external 
memory access latency varies even with a change in the operating 

15 frequency alone, the software is poor in versatility. The 

processor for usual 32-bit instructions has only 32 registers, 
which means an insufficient number of registers. 

Thus, although the above-described methods of Alpha 21464 
and Merced can raise the processing speed by parallelism 

20 extraction in the order of tens of instructions, they may be 
either poor in cost-effectiveness or incompatible with usual 
32-bit instructions, and accordingly can only be used with an 
expensive processor. 

On the other hand, if the program of Fig. 1 is altered 

25 for Merlot referred to above, the altered program will be as 



shown in Fig. 8. The pipeline will be as shown in Fig. 9, the 
issue of a future thread will become a bottleneck , and the addition 
of N data will take 2N + 7 cycles . To take note of any one processor, 
it would take charge of one thread in every four threads , and 
5 require seven cycles to process one thread. This means L + 3 
cycles at a load latency of L. On the other hand, since new 
thread issues take place at a pitch of two cycles, a new thread 
can be issued to the same processor in every 2x4 = 8 cycles. 
Since threads to be executed by the same processor are serially 

10 executed, the execution time is determined according to the 
greater issue pitch of 3, where the processing time is L + 3, 
and accordingly the addition of N data would take MAX (L + 3,8) 
N/4 + 7 cycles. At a load latency of 30, it will take 33N/4 
+ 7 cycles. The performance is poor for the mounting of four 

15 two-issued superscalar processors. 

Finally, altering the program of Fig. 1 to match the 
multi-thread processor of JP-A-8-249 1 83 cited above will result 
in what is shown in Fig. 10. Since an instruction each is needed 
for issuing and completing a future thread, altogether four 

20 instructions are needed per datum including the two instructions 
for the actual process. Furthermore, the main thread should 
arrive without fail at the code executed as a future thread after 
the future thread issue, because it is determined at the time 
of arrival whether to adopt or discard the result of execution 

25 of the future thread. It is imperative to avoid such a situation 
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that the issue of a future thread for the next repeat processing 
results in the skip of a repeat and the main thread does not 
perform the next repeat processing. Therefore, issuing at the 
beginning of a repeat the future thread at the end of the repeat 
5 is the earliest issue of a future thread. As a result, the issue 
of a future thread becomes a bottle neck in the total execution, 
and in the two-issued superscalar processor system the addition 
of N data takes 3N + 5 cycles as shown in Fig. 11. In this case, 
ADD of #10 in Fig. 11 and FORK of #9 three instructions after 
^■"'10 that are executed simultaneously. Then at a load latency or 
;i 30, the execution of these #10 and #9 will take place 26 cycles 

Jf5 later than is shown in Fig. 11 . As a result, the number of cycles 

ij is determined by the load latency to be 29N + 5 cycles . In general 

terms, it is MAX (3N + 5, (L-1) N + L + 1) cycles. While the 
:Jl5 hardware volume is than in the aforementioned Alpha 21464 , Merced 
^ and Merlot systems, the performance is poorer. 

The foregoing is summed up in Fig. 12 , wherein #1 represents 
generalization into N in the number of data and L in the load 
latency level; #2 a case in which the load latency is relatively 
20 short, i.e. 4; #3, a case in which the load latency is relatively 
long, i.e. 30; and #4 through #7, cases in which the number of 
data and the load latency length are given in specific numerals . 
It is seen that, especially where the load latency is long, 
parallelism extraction is difficult with any existing 
25 multi-thread processor. 



The problem to be solved by the present invention is to 
make possible parallelism extraction in the order of tens of 
instructions comparable to Alpha 21464 andMerced and performance 
enhancement with only a modest addition of hardware elements 
5 instead of a large-scale hardware addition as in the case of 
Alpha 21464 or a fundamental architecture alteration as in Merced, 
An especially important object of the invention is to make 
possible parallelism extraction in the order of tens of 
instructions by improving a multi-thread processor to enable 

10 a single processor to execute a plurality of threads. 

A conventional multi-thread processor simplifies new 
thread issues and dependency analysis by assigning an order of 
serial execution to a plurality of threads. However, by this 
method, even if the program is as simple as what is shown in 

15 Fig. 1, parallelism extraction is difficult. The invention 
makes possible parallelism extraction in the order of tens of 
instructions by effectively eliminating these constraints. 

While the conventional multi-thread processor assigns a 
fixed order of serial execution, the invention makes it possible 

20 to alter the order of serial execution while a thread is being 
executed. The invention thereby enables threads to be divided 
in a different manner from the conventional method. Fig. 13 
schematically illustrates the difference in thread division. 
The number assigned to each instruction in Fig. 13 denotes its 

25 position in the order of execution. The smaller its number. 
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the earlier the instruction's position in the order, which 
therefore is #00, #01, #10, #11, #71. According to the 

prior art , serial execution is simply divided on a time multiplex 
basis and threads are allocated on that basis. For this reason, 
5 as many threads as desired to be executed with priority needs 
to be generated. Fig. 13 shows an example in which division 
into eight threads takes place, and new threads are issued at 
a new thread issued instruction FORK. Though not shown, a thread 
end instruction is also required. If there is a constraint on 

10 the number of threads that can be generated, this constraint 
limits the number of processes to be given priority. According 
to the invention, threads are allocated to prior processes and 
others, and these two kinds of processes are executed while 
subjecting the order of serial execution to a time multiplex 

15 alteration. Many prior processes can be done with two threads. 
Each SYNC in Fig. 13 is a point of alteration in the order of 
serial execution. 

For instance, as there is a serial execution order altering 
point SYNC between instructions #00 and #10 of THO and between 

20 instructions #01 and #11 of THl , instructions #00 and #01, which 
are before a serial execution order altering point SYNC, are 
in earlier positions in the order of serial execution than the 
#10 and following instructions of THO and the #11 and following 
instructions of THl. Other instructions are similarly given 

25 their due positions in the order of serial execution. A serial 
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execution order altering point SYNC can be designated by an 
instruction. When it is desired to define a repeat structure 
by a repeat control instruction shown in Fig. 2, no special 
instruction will be needed if the point of time at which a return 
5 from a repeat end PC to a repeat start PC is used as the serial 
execution order altering point SYNC. 

Fig. 14 illustrates a state of thread execution at a load 
latency of 8 according to the prior art. For the convenience 
of comparison with the present invention, it is supposed that 

LO a FORK instruction can be issued in every cycle. To achieve 
the highest possible performance, eight threads have to be 
present at the same time. If the latency is 30, 30 threads will 
be required. Fig. 15 illustrates a state of thread execution 
at a load latency of 8 according to the invention. The highest 

L5 performance can be achieved with only two threads. Even if the 
latency extends to 30, two threads will be sufficient. Further, 
as an alteration in the order of serial execution involves only 
a change in the internal state to be assigned to the instruction, 
it is easier than a new thread issue instruction FORK, and can 

20 be executed in every cycle with simple hardware. 

There are three different dependency relationships: flow 
dependency, reverse dependency and output dependency. With 
respect to accessing the same register or memory address, flow 
dependency is a relationship in which "read is done after the 

25 endof every prior write"; reverse dependency , one in which ^'write 
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is done after the end of every prior read ; " and output dependency , 
one in which "write is done after the end of every prior write". 
If these rules are observed, even if the executing order of 
instructions changed, the same result can be obtained as in the 
case of an unchanged order. 

Of these relationships of dependency, reverse dependency 
and output dependency occur when the storage spaces for different 
data are secured on the same register or memory address on a 
time multiplex basis. Therefore, if temporary data storage 
spaces are secured for separate storage, thread execution whose 
order of serial execution proceeds slowly can be started even 
if there are reverse dependency and output dependency. Both 
the present invention and the prior art uses this method for 
the multi-thread processor. 

On the other hand, the rules of flow dependency should 
be observed. In the conventional multi-thread processor, if 
the presence or absence of flow dependency is uncertain at the 
time of executing an instruction, the result of execution is 
left in the temporary data storage space and, the absence of 
flow dependency is perceived, it will be stored into the regular 
storage space or , if the presence of flow dependency is perceived, 
the processing will be cancelled and retried to obtain a correct 
result. However, though this system permits normal operation, 
it guarantees no high speed operation. 

The present invention ensures high speed operation by 



eliminating the possibility of cancellation/retrial. The 
reason why a multi-thread processor may fail in flow dependency 
analysis is the possibility that, before a data defining 
instruction is decoded, another instruction using the pertinent 
data may decode and execute the data. The invention imposes 
a constraint that the defining instruction is decoded earlier 
without fail. Incidentally, in an out-of-order execution system, 
this problem does not arise because decoding is in order though 
execution is out of order. Instead, it is necessary to decode 
more instructions than the instructions to be executed and to 
select and to the executing part executable instructions. 

In the thread division system according to the invention 
shown in Fig. 13, one of every two threads defines data and the 
other uses the data . Then , they are defined to be a data defining 
thread and a data using thread, respectively, and the data 
defining thread is prohibited from using the data of the data 
using thread. Thus the data flow is made a one-way stream from 
the data defining thread to the data using thread. It is defined 
that, though the data defining thread may pass the data using 
thread, the data using thread may not pass the data defining 
thread. As it is unnecessary to analyze the flow dependency 
of the data defining thread on the data using thread, there will 
occur no wrong operation even if the data defining thread passes 
the data using thread, while the data using thread, which will 
never pass the data defining thread, no error in flow dependency 
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analysis can occur. 

The program of Fig. 1 can be modified for use in the present 
invention into what is shown in Fig. 16. The repeat structure 
of instruction #9 is defined by instructions #1, #3 and #7, and 
5 that of instruction #15, by instructions #11 through #13. By 
causing a thread generating instruction THRDG/R of the repeat 
type to start a second thread, the repeat structures of two threads 
can be configured with the point of time where a return takes 
place from repeat end PC to repeat start PC as the serial execution 

10 order altering point SYNC. The thread having issued the thread 
generating instruction THRDG/R is the data defining thread, and 
the thread generatedby the thread generating instruction THRDG/R 
is the data using thread. 

It is supposed here that a processor to which the invention 

15 is applied has a pipeline configuration of 4 in load latency 
as shown in Fig. 17. Although it is customary not to expressly 
refer to instruction address stages AO and Al as elements of 
a pipeline and accordingly reference to them was dispensed in 
describing the prior art, they will be expressly referred to 

20 in describing the operation of the present invention. In this 
case, the pipeline operates as illustrated in Fig. 18, and the 
number of execution cycles is N + 5. It being supposed that 
the number of cycles is N + 31 at a latency of 30, the latency 
at L will be N + L + 1 . Thus, this performance is comparable 

25 to that in large-scale out-of-order execution or software 
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pipelining. The pipeline operation shown in Fig. 18 will be 
described in detail afterward with reference to a specific 

embodiment . 



BRIEF DESCRIPTION OF THE DRAWINGS 

Fig. 1 illustrates a sample program. 

Fig. 2 illustrates a repeat control instruction. 

Fig. 3 illustrates an example of pipeline of a two-issued 
superscalar processor. 

Fig. 4 illustrates a two-issued superscalar pipeline 
operation of the program of Fig. 1 at a load latency of 4. 

Fig. 5 illustrates a two-issued superscalar out-of-order 
pipeline operation of the program of Fig. 1 at a load latency 
of 4. 

Fig. 6 illustrates a case in which the load latency of 
4 in the program of Fig. 1 is concealed by a software pipeline. 

Fig. 7 illustrates a two-issued superscalar pipeline 
operation of the program of Fig. 6 at a load latency of 4. 

Fig. 8 illustrates an example in which the program of Fig. 
1 is rewritten for use by a 4-parallel multi-processor of the 
Merlot system. 

Fig. 9 illustrates the pipeline operation of the program 
of Fig. 8 at a load latency of 4. 

Fig. 10 illustrates an example in which the program of 
Fig. 1 is rewritten for use by a multi-thread processor according 



17 



JP-A-8-249183. 

Fig. 11 illustrates the pipeline operation of the program 
of Fig. 10 at a load latency of 4. 

Fig. 12 compares the numbers of cycles required by existing 
5 system. 

Fig. 13 illustrates thread division systems according to 
the invention and the prior art. 

Fig . 14 illustrates thread execution according to the prior 
art at a load latency of 8 . 
10 Fig. 15 illustrates thread execution according to the 

invention at a load latency of 8 . 

Fig. 16 illustrates an example in which the load latency 
of 4 is concealed by multiple threads according to the invention. 

Fig. 17 illustrates an example of pipeline in a two-issued 
15 multi-thread processor. 

Fig. 18 illustrates the pipeline operation of the program 
of Fig. 16 at a load latency of 4. 

Fig. 19 illustrates a two— thread processor to which the 
invention is applied. 
20 Fig. 20 illustrates an example of instruction supply part. 

Fig. 21 illustrates an example of instruction selection 

part. 

Fig. 22 illustrates combinations of selected instructions 
by an instruction multiplexer. 
25 Fig. 23 illustrates an example of register scoreboard 



18 



configuration . 

Fig. 24 illustrates an example of load-based cell input 
multiplexer . 

Fig. 25 illustrates an example of top cell in the 
5 scoreboard. 

Fig. 26 illustrates an example of non-top cell in the 
scoreboard. 

Fig. 27 illustrates an example of control logic for the 
scoreboard. 

'10 Fig. 28 illustrates an example of register module. 

Fig. 29 illustrates an example of temporary buffer. 
Fig. 30 illustrate an example of bypass multiplexer. 
Fig. 31 illustrates an example of inter- thread two-way 
data communication system. 

15 

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT 

Fig. 19 illustrates an example of two-thread processor 
to which the present invention is applied. It consists of 
instruction supply parts IFO and IFl, an instruction address 

20 multiplexer MIA, instruction multiplexers MXO and MXl , the 
instruction decoders DECO and DECl , a register scoreboard RS , 
a register module RM, an instruction execution part EXO and EXl , 
and a memory control part MC . The actions of these constituent 
parts will be described below. Details of the actions of the 

25 instruction supply parts IFO and IFl, instruction multiplexers 
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MXO and MXl , register scoreboard RS , and register module RM, 
which are essential modules of the present invention, will be 
described later. 

In the description of this embodiment of the invention, 
5 for the sake of simplicity, it is supposed that the instruction 
supply part IFO is fixed to a data defining thread and the 
instruction supply part IFl is fixed to a data using thread. 
Undoing this fixation can be readily accomplished by persons 
skilled in the art to which the invention is relevant. The 
10 instruction multiplexer MXO, instruction decoder DECO and 

instruction execution part EXO are supposed to constitute a pipe 
0, and MXl, DECl and EXl , a pipe 1. 
°_ The instruction supply part IFO or IFl supplies the 

instruction address multiplexer MIA with an instruction address 

fij 

fiJlS lAO or lAl , respectively. The instruction address multiplexer 
P MIA selects one of the instruction addresses lAO and lAl as an 

instruction address lA, and supplies to the memory control part 
MC. The memory control part MC fetches an instruction from the 
instruction address lA, and supplies it to the instruction supply 
20 part IFO or IFl as an instruction I. Although the instruction 
supply parts IFO and IFl cannot fetch instructions at the same 
time, if the number of instructions fetched at a time is set 
to 2 or more, a bottleneck attributable to the instruction fetch 
would really occur. The instruction supply part IFO supplies 
25 the instruction multiplexer MXO and MXl with the top two 
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instructions out o£ the fetched instructions as 100 and 101, 
respectively- Similarly, the instruction supply part IFl 
supplies the instruction multiplexer MXO and MXl with the top 
two instructions out of the fetched instructions as 110 and 111, 
5 respectively. 

The instruction supply part IFl operates only when two 
threads are running. When the number of threads increases from 
1 to 2 , thread generation GTO from the instruction supply part 
IFO to the instruction supply part IFl and the register scoreboard 
HlO RS is asserted, and the instruction supply part IFl is actuated. 

When the number of threads returns to one, the instruction supply 
ill part IFl asserts an end of thread ETHl and stops operating. 

The instruction multiplexer MXO selects an instruction 
^i, from the instructions 100 and 111, and supplies an instruction 

n|15 code MIO to the instruction decoder DECO and register information 
□ MRO to the register scoreboard RS. Similarly, the instruction 

multiplexer MXl selects an instruction from the instructions 
110 and 101, and supplies an instruction code MIO to the 
instruction decoder decoders DECl and register information MRl 
20 to the register scoreboard RS . 

The instruction decoder DECO decodes the instruction code 
MIO, and supplies control information CO to the instruction 
execution part EXO and register information validity VRO to the 
register scoreboard RS. The register information validity VRO 
25 consists of VAO , VBO , VO and LVO representing the validity of 
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reading out of RAO and RBO and writing into RAO and RBO , 
respectively. Similarly, the instruction decoder DECl decodes 
the instruction code Mil, and supplies control information CI 
to the instruction execution part EXl and register information 
5 validity VRl to the register scoreboard RS. The register 
information validity VRl consists of VAl , VBl , VI and LVl 
representing the validity of reading out of RAl and RBI and writing 
into RAl and RBI, respectively. 

The register scoreboard RS generates a register module 
10 control signal CR and an instruction multiplexer control signal 

CM from the register information MRO and MRl , register 
= information validity VRO and VRl , thread generation GTHO and 

] end of thread ETHl , and supplies them to the register module 

RM and the instruction multiplexers MXO and MXl , respectively. 
.15 The register module RM, in accordance with the register 

7s module control signal CR, generates input data DRAO and DRBO 

to the instruction execution part EXO and input data DRAl and 
DRBl to EXl , and supplies them to the instruction execution parts 
EXO and EXl , respectively. It also stores computation results 
20 DEO and DEI from the instruction execution parts EXO and EXl 
and load data DL3 from the memory control part MC. 

The instruction execution part EXO , in accordance with 
the control information CO, processes the input data DRAO and 
DRBO, and supplies an execution result DEO to the memory control 
25 part MC and register module RM and an execution result DM0 to 
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the memory control part MC . Similarly , an instruction execution 
part El , in accordance with the control information CI , processes 
the input data DRAl and DRBl , and supplies an execution result 
DEI to the memory control part MC and the register module RM 
5 and an execution result DM1 to the memory control part MC. 

The memory control part MC , if the instruction processed 
by the instruction execution part EXO or EXl is a memory access 
instruction, accesses the memory using the execution result DEO 
or DEI. At this time, it supplies an address A and loads of 

10 stores data D. Further, if the memory access is for loading, 
it supplies the load data DL3 to the register module RM. 

To assimilate the description to the pipeline of Fig. 17, 
instruction address-related actions of the instruction supply 
parts IFG and IFl match instruction address stages AO and Bl , 
>-L5 instruction supply-related actions of the instruction supply 

- parts IFO and IFl and actions of the instruction multiplexers 

MXO and MXl to instruction fetch stages 10 and II, actions of 
the instruction decoders DECO and DECl to instruction decode 
stages DO and DI , actions of the instruction execution parts 

20 EXO and EXl to the instruction execution stages EO and El, and 
actions of the memory control part MC to load stages LI, L2 and 
L3 . The register scoreboard RS holds and updates information 
on the stages of instruction decoding, execution and loading. 
The register module RM operates when read data are supplied at 

25 the instruction decode stages DO and Dl and when data are written 
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back at the instruction execution stages EO and El and the load 
stages L3 . 

Fig. 20 illustrates an example of instruction supply part 
IFj (j =0, 1) of the processor of Fig. 19. During regular 
5 operation, a +4 incrementer generates the next program counter 
PCj + 4 from the program counter PCj ; multiplexers MXj and MRj 
selects and supplies it as an instruction address laj and also 
stores into the program counter PCj . By repeating this 
processing, the instruction address laj is incremented by 4 at 
>i=10 a time, and requests fetching of a consecutive address 

instruction. The instruction IL fetched from the instruction 
:^ address laj is stored into an instruction queue Qjn (where n 

is the entry number) . Whenever an instruction is to be stored, 
PCj and the number of repeats RCj , to be explained later, are 
';15 stored into the program counter Pcjn and a validity bit Ivjn 
1 is asserted. 

A branching instruction decoder BDECJ takes out and decodes 
branching-related instructions (branching, THRDG, THRDE, LDRS, 
LDRE , LDRC , etc.) from the instruction queue IQJn, and supplies 
20 an offset OFSj and the thread generation signal GTHO or the end 
of thread ETHl . It then adds the program counter Pcjn and the 
offset OFSj with an adder Adj . 

Where the instruction is a branching instruction or a thread 
generating instruction THRDG, the instruction address 
25 multiplexers MXj and MRj selects the output of the adder ADj 



as the branching destination address, supplies it to the 
instruction address laj and also stores it into the program 
counter PCj . They store the instruction IL fetched from the 
instruction address laj into the instruction queue Iqjn if it 
is a branching instruction or into the instruction queue Iqln 
of IFl if it is the thread generating instruction THRDG. The 
instruction supply part IFO , if the instruction is the thread 
generating instruction THRDG, further asserts the thread 
generation GTHO , and actuates the instruction supply part IFl. 
The instruction supply part IFl, if the instruction is the end 
of thread instruction ETHRD, asserts the end of thread ETHl and 
stops operating. 

If the instruction is the LDRS instruction of Fig. 2, the 
output of the adder ADj is stored into a repeat start address 
RSj . If the instruction is the LDRE instruction of Fig. 2, the 
output of the adder ADj is stored into a repeat end address Rej . 
If the instruction is the LDRC instruction of Fig. 2, the offset 
OFSj is selected by a nuraber-of-repeats multiplexer MCj as the 
number of repeats and stored into the number of repeats RCj . 
The number of repeats shall be not less than one, and even if 
0 is specified, the repeat will be skipped after one repeat is 
executed. At the same time, the repeat start address RSj and 
the repeat end address REj are compared by a repeated instruction 
number comparator CRj . If they are found identical, this means 
that 1 instruction is repeated, and therefore that 1 instruction 
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continues to be held in the instruction queue IQjn to deter the 
instruction from being fetched. 

When the repeat mechanism is not used, the number of repeats 
RCj is set to zero. At this time, other bits than the least 
5 significant of the number of repeats RCj are entered into a number 
of times comparator CCj and compared with zero. As the result 
of comparison is identity with zero, the output of an end of 
repeat detecting comparator CEj is masked by an AND gate, and 
the instruction address multiplexer MRj selects the output of 

10 the instruction address multiplexer MXj without relying on the 
input PCj to the end of repeat detecting comparator CEj and the 
value of Rej , with no repeat processing carried out. 

When addresses are stored into the repeat start address 
RSj and the repeat end address REj and a value of 2 or above 

15 is stored into the number of repeats RCj , the repeat mechanism 
is actuated. The program counter PCj and the end of repeat address 
Rej are compared by the end of repeat detecting comparator CEj 
all the time, and an identify signal is supplied to the AND gate. 
When the program counter PCj and the repeat end address REj become 

20 identical, the identify signal takes on a value of 1. If then 
the number of repeats RCj is not less than 2, as the output of 
the end of repeat detecting comparator CEj becomes 0 , the output 
of the AND gate becomes 1 , and the instruction address multiplexer 
MRj selects the repeat start address RSj , supplying it as the 

25 instruction address laj . As a result, the instruction fetch 



returns to the repeat start address. At the same time as the 
action stated above, the number of repeats RCj is decremented, 
and the result is selected by the number-of-repeats multiplexer 
MCj to become an input to the number of repeats RCj . The number 
of repeats RCj is updated unless the program counter PCj and 
the repeat end address RE j are identical and the number of repeats 
RCj is zero. In the instruction queue Iqj n , the number of repeats 
RCj matching each instruction in the queue is assigned as a thread 
synchronization number IDjn. When the number of repeats RCj 
becomes one, the output of the number of times comparator CCj 
becomes one with the result that repeat processing no longer 
takes place and the number of repeats RCj is updated to zero 
to end the operation. In the case of 1 instruction repeat, the 
instruction continues to be held in the instruction queue Iqjn, 
and only the thread synchronization number IDjn is updated. At 
the time of the end of repeat, the process returns to the usual 
instruction queue Iqnj operation. 

Incidentally, it is also possible use less significant 
bits of the number of repeats RCj as the thread synchronization 
number Idjn. In this case, if the data defining thread is too 
far ahead, the thread synchronization numbers IDOn and IDlm 
(where m is the entry number) may become identical in spite of 
the difference between the numbers of repeats RCO and RCl . In 
such a case , the data defining thread is deterred from instruction 
fetching. Thus, if the thread synchronization numbers IDOn and 
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IDlm are identical and the numbers of repeats RCO and RCl are 
different, IFO performs no instruction fetching. 

Fig. 21 illustrates an example of instruction multiplexer 
Mj (j = 0, 1) of the processor of Fig. 19. The instruction Ix 
5 (x = jO, kl, j) consists of an operation code OPx, register fields 
RAx and RBx, a thread synchronization number IDx and an 
instruction validity IVx . The instruction multiplexer Mj 
selects out of two instructions IjO and Ikl ({j, k} = {0, 1}, 
{1, 0}) the instruction IjO if the instruction IjO is executable 

10 or, if not, the instruction Ikl as the instruction Ij . Then 
it supplies the selected thread as a thread number THj . Thus , 
if the instruction IjO is selected, THj = j , or if the instruction 
Ikl is selected, THj = k. Of the constituent elements of the 
instruction Ij , the operation code OPj and the instruction 

15 validity IVj are supplied to the instruction decoders DECj as 
the instruction code Mij , the register fields RAj and RBj , the 
thread synchronization number IDj and thread number THj are 
supplied to the register scoreboard RS as the register 
information MRj . 

20 Executability is judged according to data dependency on 

the instruction under prior execution. In a pipeline 
configuration of 4 in load latency as shown in Fig. 17, execution 
may be made impossible by flow dependency on three prior 
instructions. THj generating logic illustrated in Fig. 21 

25 carries out determination of this flow dependency and 
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determination of the validity of instructions. This logic 
similar to the register scoreboard RS to be explained later. 
It receives scoreboard information CM from the register 
scoreboard RS and performs determination. First, it is checked 
with an instruction code OP j 0 whether or not the register fields 
RAj 0 and RB j 0 are to be used for reading out of registers, read 
validities MVAj and MVBj are generated. Read RA and read RB 
are functions for this purpose, and if the code allocation for 
instructions is regular, high speed determination is possible 
by merely checking part of the instruction code OP j 0 . Further, 
in order to unify the formula, out of write-back possible Ry 
(y = L, LO, LI) , RL which essentially does not exist is defined 
to be RL = 0 . Flow dependency detection MFjy then is as shown 
in Fig. 21. Flow dependency arises if valid read and write 
register numbers are identical when writing back into the same 
thread, same thread synchronization number or same register file 
is possible. If no flow dependency arises and the instruction 
is valid, selection validity MVj is asserted, and Ij and THj 
are selected on thebasis of thatMVj . Further , the THj generating 
logic ensures that the data using thread may not pass the data 
defining thread. This is achieved by so arranging that THj be 
equal to 0 when thread synchronization numbers ID j 0 and IDkl 
are identical. Thus, when the thread synchronization numbers 
are identical, the data defining thread is selected. 
Incidentally, since the determination of data dependency takes 
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time, where the fetch instruction from the memory control part 
MC is not latched into the instruction queue IQjn and directly 
supplied to the instruction multiplexer Mj , no determination 
of data dependency is performed, the instruction is supplied 
5 in anticipation of executability . Usually, what is directly 
supplied is the top instruction of a branching destination and 
accordingly is likely to be executable. 

By the above-described selection method, instructions are 
selected according to the executability of the instructions 100 

10 and 110 as shown in Fig. 22. In the case of #1, the instructions 
100 and 110 are selected, and both are executable. In the case 
of #2, as the instruction 110 is inexecutable , the instruction 
111 is also inexecutable. On the other hand, out of the selected 
instructions 1100 and 101, 100 is executable and the 

.15 executability of 101 is unknown. Thus, an instruction or 
'l instructions which are known to be or may be executable are 

selected, but no inexecutable instruction is selected. The same 
is true of #3. In the case of #4, since both instructions 100 
and 110 are inexecutable, all the four instructions are 

20 inexecutable, whichever instruction that may be selected is not 
executed. 

Fig. 23 illustrates an example of register scoreboard RS. 

As in the conventional processor, write information into a 
register file matching the pipeline stage is held and compared 
25 with new read information to detect three kinds of dependency 



regarding registers, including flow dependency, reverse 
dependency and output dependency. Also, write information into 
a register file, which is temporarily deterred by reverse 
dependency or output dependency is held and compared with new 
read information to detect the three aforementioned kinds of 
dependency. Further, whether or not writing is possible 
according to reverse dependency or output dependency is 
determined, and a write instruction is given. Details will be 
described below. 

Cells SBLO which are not at the top of scoreboard hold 
load data write information RL selected by a multiplexer ML out 
of the register information MRO or MRl as control information 
for the load stage LO , and generate and supply bypass control 
information BPLOy (y = RAO, RBO, RAl , RBI) and next stage control 
information NLO from the held data and the register information 
MRO and MRl. Similarly, cells SBEO and SBEl which are at the 
top of scoreboard hold the register information MRO and MRl as 
control information for the execution stages EO and El, 
respectively , and generate and supply bypass control information 
BPEOy and BPEly and next stage control information NEC and NEl 
from the held data and the register information MRO and MRl. 
Also, cells SBLl, SBL2 and SBL3 which are not at the top of 
scoreboard hold next stage control information NLO, NLl and NL2 
as control information for the load stages LI, L2 and L3 , and 
generate and supply bypass control information BPLly, BPL2y and 
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BPL3y and next stage control information NLl, NL2 and NL3 from 
the held data and the register information MR and MRl . Further , 
cells SBTBO , SBTBl and SBTB2 which are not at the top of scoreboard 
hold temporary buffer control information NMO , NMl and NM2 
5 selected by the scoreboard control part CTL as temporary buffer 
control information, and generate and supply bypass control 
information BPTBOy, BPTBly and BPTB2y and next cycle control 
information NTBO , NTBl and NTB2 from the held data and the register 
information MRO and MRl . Also, the scoreboard control part CTL 

10 performs detects any stall according to flow dependency and 
temporarily buffer fullness and controls writing into the 
register file RF and a temporary buff er TB . Further, it supplies 
input signals for scoreboard cells SBLO , SBLl and SBL2 to the 
instruction multiplexers MXO and MXl as scoreboard information 

15 CM = {RL, THL, IDL, VL, NLO , NLl}. 

Details of the multiplexer ML, cells SBLO, SBEO and SBEl 
which are at the top of scoreboard, cells SBLl , SBL2 , SBL3 , SBTBO , 
SBTBl and SBTB2 which are not at the top of scoreboard, and the 
scoreboard control part CTL will be described below with 

20 reference to Fig. 24 through Fig. 27. 

Fig. 24 illustrates an example of multiplexer ML. Write 
information on load instructions is selected from the register 
information MRO or MRl. If both are load instructions, 
information on the prior instruction is selected. If neither 

25 is a load instruction, either can be selected. Therefore, if 
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the prior instruction is a load instruction, its register 
information or , if it is not a load instruction , the other register 
information is selected. As stated above, the register 
information MR j (j =0, 1) consists of register fields Raj and 
5 RBj , a thread synchronization number IDj and a thread number 
THj . As will be explained later, if the thread number THO is 
0, the instruction 10 is the prior instruction, or if the thread 
number THO is 1, the instruction II is. As the first term in 
the equation of selecting condition for the register information 
HlO MRO given in Fig. 24 is THO = 0 and the write signal LVO being 
Jpi asserted, the instruction 10 is the prior instruction and a load 

rli 

instruction. On the other hand, as the second term is THO = 

Mi 

lij ^ "t^® write signal LVl being negated, the instruction II 

hi is the prior instruction and a non-load instruction. A load 

r[|15 pipe SBL indicating which has been selected is supplied to the 

g 

O scoreboard control part CTL. As stated in the description of 

the multiplexer ML , if the thread number THO is 0 , the instruction 
10 is the prior instruction, or if the thread number THO is 1, 
the instruction II is. At the time of stall, as the instruction 
20 is not executed, the write validity VL is invalidated with a 
stall signal STLO or STLl . 

If the thread number THO is 0, the combination of 
instructions selected by the instruction multiplexer MXO is 
either #1 or #2 in Fig. 22. If it is #1, the instruction 10 
25 is the instruction 100 of the data defining thread supplied from 
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the instruction supply part IFO, and the instruction II is the 
instruction 110 of the data using thread supplied from the 
instruction supply part IFl. Therefore, if the instruction 100 
is executed earlier than the instruction 110 , it does not violate 
5 the execution order rule for data defining threads and data using 
threads according to the present invention. If it is #2, the 
instructions 10 and II is the instructions 100 and 101, and 10 
is prior in the order of serial execution. On the other hand, 
if the thread number THO is 1, the combination of instructions 

10 selected by the instruction multiplexer MXO is either #3 or #4 
in Fig. 22. If it is #3, the instructions 10 and II is the 
instructions 111 and 110, and II is prior in the order of serial 
execution. If it is #4, both the instructions 10 and II are 
inexecutable . From the foregoing, if the thread number THO is 

15 0, the instruction 10 is the prior instruction, or if the thread 
number THO is 1, the instruction II is. 

Fig. 25 illustrates an example of top cell SBx (x = LO, 
EO, El) in the scoreboard. Inputs Rs , THt, IDt and Vt&~u {{s, 
t, u} = {L, L, 1}, {AO, 0, STLO}, {Al, 1, STLl } ) are held as 

20 a write register number Wx, a write thread number THx, a write 
thread synchronization number IDx and a write validity Vx, which 
constitute x stage write information, and bypass control 
information BPxy (y = RAO, RBO , RAl , RBI) and next stage write 
control information Nx = {Wx, THx, IDx, BNx, Vx} are generated 

25 and supplied from these inputs and the register information MRO 



and MRl , register write signals VO and LO , and VI and LI . Masking 
of the input Vt with u is to invalidate write information because 
no instruction is executed at the time of stall. 

The first equation of the logical part SBxL of Fig. 25 
is the defining equation for the bypass control information BPxy . 
The bypass control information BPxy is asserted when writing 
at the X stage is valid, the write register number Wx and the 
register read number y are identical, and writing and reading 
have the same thread number or the same thread synchronization 
number. If they have the same thread number, it means bypass 
control within the thread, which is commonly accomplished in 
conventional processors as well. On the other hand, if they 
have the same thread synchronization number, it means bypass 
control from a data defining thread to a data using thread. The 
absence of bypass control in the reverse direction, i.e. from 
a data using thread to a data defining thread, is due to the 
configuration of the instruction multiplexer Mj which does not 
permit the data using thread to pass the data defining thread. 

Out of the elements of the next stage write control 
information Nx , the held information of the write register number 
Wx, write thread number THx, write thread synchronization number 
IDx and write validity Vx is supplied as it is. Write back BNx 
indicates that reverse dependency and output dependency have 
been eliminate, making possible writing back into the register 
file. In this embodiment, if the thread synchronization number 
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of the data using thread is identical with the thread 
synchronization number of the write control information, 
assertion is done and continued until writing back is achieved. 
The second equation of the logical part SBxL of Fig. 25 is the 
5 defining equation for the write back BNx. 

Fig. 26 illustrates an example of cell SBx (x = LI , L2 , 
L3, TBO, TBI, TB2) which is not at the top of scoreboard. Input 
signals Wt, THt, IDt, BNt and Vt (t = LO, LI, L2, MO, Ml, M2) 
are held as a write register number Wx, write thread number THx, 

10 write thread synchronization number IDx, write back Bx and write 
validity Vx , which constitute x stage write information, and 
bypass control information BPxy (y = RAO, RBO, RAl , RBI) and 
next stage write control information Nx = {Wx, THx, IDx, BNx, 
Vx} are generated and supplied from these inputs and the register 

15 information MRO and MRl , register write signals VO and LO, and 
VI and LI. 

The first equation of the logical part SBxL of Fig. 26 
is the defining equation for the bypass control information BPxy . 
The bypass control information BPxy is asserted when writing 

20 at the x stage is valid, the write register number Wx and the 
register read number y are identical, and writing and reading 
have the same thread number and the same thread synchronization 
number or write back is being asserted. The difference from 
what is shown in Fig. 25 consists in the addition of the condition 

25 of write back Bx being asserted. According to this condition. 
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data not yet written back are supplied on a bypass basis in place 
of the register value. The second equation of the logical part 
SBxL of Fig. 26 is the defining equation for the write back BNx. 
The difference from Fig. 25 consists in the addition of the 
5 condition of write back Bx being asserted- According to this 
condition, the write back Bx, once asserted, continues to be 
asserted until it is written back. 

Fig. 27 shows an example of scoreboard control logic CTL 
in Fig. 23. Any stall due to flow dependency is detected in 

10 the following manner. As the load latency is 4, data matching 
the write control information NLz (z = 0 , 1 , 2) are not yet valid. 
Therefore, if the bypass control BPzy (y = AO , Al , BO, Bl) is 
asserted, bypassing of invalid data is required, which cannot 
be realized. Accordingly, if any such signal is asserted, it 

15 is necessary to have the execution start of any instruction using 
the bypass data wait until the data become valid . For this reason 
stall signals STLO and STLl in which bypass control BPzy is 
collected are supplied. On this occasion, the bypass control 
BPzy is masked with read validities VAO , VBO , VA and VBl out 

20 of the register information validities VRO and VRl . Further, 
as the prior instruction is stalled, the posterior instruction 
is also stalled to maintain the order of serial execution. As 
stated in the description of the multiplexer ML, if the thread 
number THO is 0, the instruction 10 is the prior instruction, 

25 or if the thread number THO is 1, the instruction II is. Or, 
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if both prior and posterior instructions are data load 
instructions , the posterior instruction is stalled. If the pipe 
not selected by the multiplexer ML, i.e. the pipe not indicated 
by the load pipe SBLm and the write validity LVO or LVl to the 
5 write register RBO or RBI for data loading are asserted, stalling 
is carried out. From the foregoing, stall signals STLO and STLl 
are defined by the first through fourth equations of Fig. 27. 
An individual thread STH is negated during the period from the 
thread generation GTHO until the end of thread ETHl . Therefore 
10 its generation formula takes on the form of the fifth equation 
of Fig. 27. 

The write data are validated upon the end of the pipeline 
stage EO , El or L3 . Theraatchingwrite information of the register 
scoreboard RS is NEO , NEl or NL3 . The data held in the temporary 

15 buffer are also valid. Valid data are written back into the 
register file RF as soon as reverse dependency or output 
dependency is eliminated. As a thread number THx (x = EO = El 
= L3 = TBO , TBI, TB2) of 1 means a data using thread, neither 
reverse dependency nor output dependency arises, and valid data 

20 can be written at any time. On the other hand, if the thread 
number THx is 0, the data can be written back when the reverse 
dependency or output dependency is eliminated and write back 
Bx is asserted. Further, while an individual thread STH is being 
asserted, neither reverse dependency nor output dependency 

25 arises. From the foregoing, a write indication Sx takes on the 
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form of the sixth equation of Fig. 27. Where valid data are 
prevent by either reverse dependency or output dependency from 
being written, a temporary buffer control Cx is asserted to write 
into the temporary buffer TB. The temporary buffer control Cx 
5 takes on the form of the seventh equation of Fig. 27. As the 
temporary buffer TB has three entries, if four or more of the 
six temporary buffer controls Cx are asserted, writing into the 
temporary buff er TB is impossible . In this case, the stall signal 
STLTB attributable to the temporary buffer is asserted to stop 

10 the progress of the pipeline . If no more than three are asserted, 
writing is possible. Since writing into the temporary buffer 
TB is done only from a data defining thread, the data written 
into it are in the order of serial execution. The positions 
in this order are always TB2 , TBI and TBO from the earliest onward, 

15 and write data into the temporary buffer TB are selected so that 
TBO is selected where one entry in the temporary buffer TB is 
to be used, or TBO and TBI are selected where two entries are 
to be used . Generation of data selections MO , Ml and M2 according 
to this principle would result in the table of Fig. 27. 

20 Incidentally, positions in the order of serial execution 

including write data from the pipeline stage EO , El or L3 are 
TB2, TBI, TBO, L3 , EO and El from the earliest onward. Then 
according to the data selections MO, Ml and M2 , the next stage 
write control information Nt (t = MO, Ml, M2) is selected from 

25 Nx. The final three equations of Fig. 27 are the selection 
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formulas. Fig. 28 illustrates an example of register module 
RM of the processor shown in Fig . 19. It consists of the register 
file RF, a temporary buffer TB and a read data multiplexer My 
(y = AO, Al , BO, Bl) . It has the register control signal CR 
5 and output data DEO, DEI and DL3 as its inputs and read data 
DRy (y = AO, Al , BO, Bl) as its output. The register control 
signal CR consists of a register read number Ry, bypass control 
BPxy (x = EO, El, L3 , TBO , TBI, TB2) , register write number Wx, 
register write control signal Sx, temporary buffer write data 

10 selection Mz (z = 0, 1, 2) and thread number THO . 

The register file RF has 16 entries, 4 reads and 6 writes. 
When the write control signal Sx is asserted, data Dx are written 
into No . Wx of the register f lie RF . Also , No . Ry of the register 
file RF is read as register read data RDy. 

15 The temporary buffer TB, having a bypass control BPTBzy, 

data selection Mz and output data DEO , DEI and DL3 as its inputs , 
supplies temporary buffer hold data DTBz and temporary buffer 
read data TBy as its outputs. It also updates the hold data 
DTBz in accordance with the write data selection signal Mz . 

20 Details will be described with reference to Fig. 29. The 

temporary buffer hold data DTBz are constantly supplied. The 
selection logic for the write data DNTBz is expressed in the 
first three equations of the temporary buffer multiplexer TBM. 
The selection is done according to the selection signal Mz . The 

25 selection logic for the read data TBy is expressed in the final 
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equation of the temporary buffer multiplexer TBM. The selection 
is done according to the bypass control BPTBzy. 

Incidentally, when a plurality of bypass controls BPzy 
are asserted, the latest data are selected. Namely, the last 
5 in the order of serial execution is selected. 

The read data multiplexer My has the bypass control BPxy, 
thread number THO , register read data RDy , temporary buffer read 
data TBy and output data DEO , DEI and DL3 as its inputs and supplies 
read data DRy (y = AO, Al , BO, Bl) as its output. Details will 

10 be described with reference to Fig. 30. Even when a plurality 
of bypass controls BPxy are asserted, it selects the latest data . 
Between the output data DEO and DEI, DEI is newer if the thread 
number THO is 0, or DEO is newer if it is 1. As a result, the 
selection logic is as stated in the frame on the left hand side 

15 of Fig. 30. The temporary buffer bypass control BPTBy then is 
the logical sura of three bypass controls BPTBzy as in the logic 
expressed in the frame on the right hand side of Fig. 30. 

Now, actual execution of the program of Fig. 16 by this 
embodiment of the invention would consist of the following 

20 actions. First at a point of time tO , the instruction address 
stage AO of the instructions #1 and #2 is implemented. The 
instruction supply part IFO places the address of the instruction 
#1 over the instruction address lAO , and issues a fetch request 
to the memory control part MC. At the same time, it latches 

25 the instruction address lAO to the program counter PCO . Then, 
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the instruction address multiplexer MIA selects lAO as lA, and 
supplies it to the memory control part MC . 

At the next cycle time tl , the instruction address stage 
AO of the instructions #3 and #4 is implemented. To the program 
5 counter PCO is added 4, the result being placed over the 

instruction address lAO and supplied to the memory control part 
MC via the multiplexer MIA, and a fetch request is issued. At 
the same time, the instruction address lAO is latched to the 
program counter PCO. Further, the instruction fetch stage 10 

10 of the instructions #1 and #2 is implemented. The memory supply 
part MC fetches two instructions, i.e. the instructions #1 and 
#2, from the address of the instruction #1, and supplies them 
to the instruction supply part IFO as the fetch instruction IL. 
The instruction supply part IFO stores them into the instruction 

15 queue IQOn and , at the same time , supplies them to the instruction 
multiplexer MXO and MXl as the instructions 100 and 101. As 
the repeat counter RCO then is at 0, the count indicating the 
non-use of the repeat mechanism, 0 is assigned as the thread 
synchronization numbers IDOO and IDOl. The instruction 

20 multiplexers MXO and MXl respectively select instructions 100 
and 101, generate the instruction codes MIO and Mil and the 
register information MRO and MRl , and supply them to the 
instruction decoders DECO and DECl and the register scoreboard 
RS. Thus, the instructions #1 and #2 are supplied to the pipe 

25 0 and the pipe 1, respectively. Incidentally, though the 
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instruction #1 is a branching-related instruction, as its supply 
immediately after an instruction fetch is before the analysis 
by the branching-related instruction decoder BDECO , it is 
supplied to the instruction decoder DECO , which turns the 
processing into a no-operation (NOP) . 

At the point of time t2 , the instruction address stage 
AO of the instructions #5, #6 and #9 is implemented. First, 
4 is added to the program counter PCO of the instruction supply 
part IFO for updating, and a request to fetch the instructions 
#5 and #6 is issued. As the instruction #9 is a repeat start 
and end instruction, repeat setup is accomplished with the 
instructions #1 , #3 , and #5 . The branching-related instruction 
decoder BDECO decodes the LDRE instruction of the instruction 
#1, adds an offset OFSO to the program counter PCO and the 
instruction #9 to generate the address of the instruction #9 , 
and stores it at the end of repeat address REG. As at the point 
of time tl, the instruction fetch stage 10 of the instructions 
#3 and #4 is implemented. Further, as the actions of the 
instruction decode stages DO and Dl of the instructions #1 and 
#2, the following is performed. As the instruction #1 is a 
branching-related instruction, the instruction decoder DECO 
turns the processing into an NOP. The instruction decoder DECl 
decodes the instruction #2 to supply the control information 
CI, and further supplies the register information validity VRl . 
The instruction #2 is an instruction to store a constant x_addr 



at rO. Although an address usually consists of 32 bits, the 
addresses of x_addr and y_addr to be explained later are reduced 
in size to be expressed in immediate values in the instruction. 
Then the immediate value x_addr is placed over the control 
information CI to be supplied to the instruction execution part 
EXl. Further, as RAl is to be used for write control to rO , 
VI out of the register information validity VRl is asserted. 
In the register scoreboard RS, the write information of the 
instruction #2 is stored into the scoreboard cell SBEl. 

At a point of time t3 , as the actions of the instruction 
address stage AO of the instructions #7, #8 and #9 , the following 
is performed. First, as at the point of time t2 , a request to 
fetch the instructions #7 and #8 is issued. The 
branching-related instruction decoder BDECO decodes the LDRS 
instruction of the instruction #3, adds the offset OFSO to the 
program counter PCO and the instruction #9 to generate the address 
of the instruction #9 , and stores it at the repeat start address 
RSO. At the same time, the repeat start address RSO and the 
end of repeat address REO are compared by a repeat address 
comparator CRO . Both represent the instruction #9 , accordingly 
are identical and provide for 1 instruction repeat, this identity 
information is stored. Also, as at the point of time tl , the 
instruction fetch stage 10 of the instructions #5 and #6 is 
implemented. Further, as the actions of the instruction decode 
stages DO and Dl of the instructions #3 and #4, the following 
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is performed. As the instruction #3 is a branching-related 
instruction, the instruction decoder DECO turns the processing 
into an NOP. The instruction decoder DECl , because the 
instruction #4 is an instruction to store a constant Y_addr at 
rl, places the constant y_addr over the control information CI, 
and supplies it to the instruction execution part EXl . Further , 
as Rl is to be used for write control to rl , VI out of the register 
information validity VRl is asserted. Also, the instruction 
execution stage El of the instruction #2 is performed. The 
instruction execution part EXl executes the instruction #2 in 
accordance with the control information CI. Thus the immediate 
value x_addr is supplied to the execution result DEI. 

The register scoreboard RS supplies the write information 
of the instruction #2 from the scoreboard cell SBEl and, as the 
control part CTL has an individual thread STH and write validity 
VEl, asserts the register write signal SEl. As a result, in 
the register file RF of the register module RM, the immediate 
value x_addr, which is the execution result DEI, is written at 
rO designated by the write register number WEI . Also , the write 
information of the instruction #4 is stored into the scoreboard 
cell SBEl. 

At a point of time t4 , as the actions of the instruction 
address stages AO and Al of the instructions #11 and #12, the 
following is carried out. The branching-related instruction 
decoder BDECO of the instruction supply part IFO decodes the 



THRDG/R instruction of the instruction #5 , adds to PCO the offset 
OFSO for the instruction #11 to generate the top address of the 
new thread, i.e. the address of the instruction #11, places it 
over the instruction address lAO , and issues an instruction fetch 
request to the memory control part MC. Also, as at the point 
of time tl , the instruction fetch stage 10 of the instructions 
#7 and#8 isperformed. Further, as theactions of the instruction 
decode stages DO and Dl, the following is carried out. 

As the instruction #5 is a branching-related instruction, 
the instruction decoder DECO turns the processing into an NOP. 
The instruction decoder DECl decodes the instruction #6, places 
the immediate value 0 over the control information CI as in the 
case of the instruction #2, supplies it to the instruction 
execution part EXl , and asserts VI out of the register information 
validity VRl . It also implements the instruction execution 
stage El of the instruction #4 as it did for the instruction 
#2 at the point of time t3 . The register scoreboard RS and the 
register module RM process the instructions #4 and #6 as they 
did for the instructions #2 and #4 at the point of time t3 . 

At a point of time t5 , as the actions of the instruction 
address stage AO of the instructions #9 and #10 the following 
is performed. First, as at the point of time t2 , a request to 
fetch the instructions #9 and #10 is issued. The 
branching-related instruction decoders BDECO of the instruction 
supply part IFO decodes the LDRC instruction of the instruction 



#7, places the number of repeats 8 over OFSO, and stores it at 
the number of repeats RCO . This completes the repeat setup. 
Also the instruction fetch stage II of the instructions #11 and 
#12 is implemented. The memory control part MC fetches the 
instructions #11 and #12, and the instruction supply part IFl 
adds 0 to them as the thread synchronization number IDln, holds 
the result in the instruction queue IQln, and also supplies them 
to the instruction multiplexer MXl and MXO as the instructions 
no and 111. However, as the thread synchronization numbers 
of both the data defining thread on the instruction supply part 
IFO side and the data using thread of the instruction supply 
part IFl are 0 and accordingly identical, the instruction 
multiplexers MXl and MXO selects the instruction supply part 
IFO side, which is the data defining thread, in accordance with 
the selection logic of Fig. 21. As there is no instruction in 
the instruction queue IQOn then, invalid instructions are 
supplied to the instruction decoders DECO and DECl. Further, 
as the actions of the instruction decode stages DO and Dl the 
instructions #7 and #8, the following is performed. Since the 
instruction #7 is a branching-related instruction, the 
instruction decoder DECO turns the processing into an NOP. The 
instruction decoder DECl decodes the instruction #8 , and supplies 
NOP control. Furthermore, it implements the instruction 
execution stage El of the instruction #6 as it did the instruction 
#2 at the point of time t3 . The register scoreboard RS and the 
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register module RM processes instruction #6 as was the case with 
#4 at the point of time t3 . 

At a point of time t6 , the instruction address stage AO 
of the instruction #9 is implemented. At the instruction supply 
part IFO, the program counter PCO and the end of repeat address 
REO become identical to cause the comparator CEO to give an output 
of 1. As the number of repeats RCO is eight, a comparator CCO 
gives an output of 0 and, as the AND output is 1, the multiplexer 
MRO selects the repeat start address RSO , which is supplied as 
the instruction fetch address lAO and stored into the program 
counter PCO . The number of repeats RCO is decremented to seven, 
which is selected by the multiplexer MCO and stored at the number 
of repeats RCO. Further, as this is a repeat of 1 instruction, 
the instruction queue IQOn is indicated to hold instructions 
from #9 onward. Further, the instruction address stage Al of 
the instructions #13, #14 and #15 is implemented. The program 
counter PCI of the instruction supply part IFl is updated by 
adding 4, and a request to fetch the instructions #13 and #14 
is issued. The branching-related instruction decoder BDECl 
decodes the LDRE instruction of the instruction #11, and stores 
the address of the instruction #15 at the end of repeat address 
REl as was the case with the instruction #5. Further, as at 
the point of time tl , the instruction fetch stage 10 of the 
instructions #9 and #10 is implemented. As the thread 
synchronization number IDO , 0 is added then. Incidentally, as 



the first repeat action is revealed when the end of repeat address 
REO is reached, the thread synchronization number is not 8 but 
0 as before the repeat range is reached. As the indication to 
hold instructions is still in effect, the instructions #9 and 
#10 are held in the instruction queue IQOn even after the supply. 
To add, the instructions #11 and #12 are held in the instruction 
queue IQln, and there is time for the branching-related 
instruction decoder BDECl to analyze the instructions #11 and 
#12 and judge both are branching-related instructions and there 
is no other instruction, the instruction queue IQln has no 
instruction to supply to the instruction decoder. Nor is there 
any instruction to be processed at the instruction fetch stage 
II. 

At a point of time t7 , the instruction address stages AO 
and Al of the instructions #9 and #15 are implemented. The 
instruction supply part IFO performs a repeat action as in the 
preceding cycle to increase the number of repeats RCO to six. 
The branching-related instruction decoders BDECl of the 
instruction supply part IFl decodes the LDRS instruction of the 
instruction #12, stores the address of the instruction #15 at 
the repeat start address RSI as was the case with the instruction 
#3, and stores address identify information for 1 instruction 
repeat control. Also, the instruction fetch stages 10 and II 
of the instructions #9, #13 and #14 are implemented. The 
instruction supply part IFO adds 7 as the thread synchronization 



number IDOO to the instruction #9 held in the instruction queue 
IQOn, and supplies the result to the instruction multiplexer 
MXO as the instruction 100. Incidentally, this action is done 
using the pre-decrement value simultaneously with the foregoing 
decrement. For this reasons, the added value is 7. As this is 
a repeat action the instruction immediately following the 
instruction #9 is not the instruction #10. Accordingly there 
is no instruction to be supplied as the 1 instruction 101, and 
the instruction validity IVOl of the instruction 101 is negated. 
The memory control part MC fetches the instructions #13 and #14 , 
and the instruction supply part IFl adds to them 0 as the thread 
synchronization number IDln. The result is stored into the 
instruction queue IQln, and at the same time supplied to the 
instruction multiplexer MXl and MXO as the instruction 110 and 
111 . Though the instruction #9 then supplied as the instruction 
100 entails register reading, as there is no prior data load 
instruction, all the write validities VL, VLO and VLl of the 
scoreboard information CM are negated, and no flow dependency 
arises . 

Further, the instruction #13, as it immediately follows 
a fetch, is subjected to no executability determination. As 
a result, the instruction multiplexers MXl and MXO select the 
instructions 100 and 110, i.e. the instructions #9 and #13, and 
supply them to the instruction decoders DECO and DECl . The 
instruction decode stage DO of the instruction #9 is also 



implemented. The instruction decoder DECO , as the instruction 
#9 is an instruction to load data from an address indicated by 
the register rO into the register r2 and increment the register 
rO, supplies its control information CO. Further, as RAO is 
used for the read and write control of rO and RBO for the write 
control of r2 , VAO , VO and LVO out of the register information 
validity VRl are asserted. 

The register scoreboard RS supplies the register read 
number RAO and the bypass control BPxy (x = EO, E1,-L0, LI, L2 , 
L3 , TBO , TBI , TB2 ; y = AO , BO, Al , Bl) . In the diagram of pipeline 
operation shown in Fig. 18, the write and read register numbers 
and thread synchronization number of each scoreboard cell are 
added under each point of time. The hatched parts represent 
the thread 1 (data using thread) information and other parts, 
the thread 0 (data defining thread) information. At the point 
of time t7 , as there is no valid write information, all the bypass 
controls BPxy are negated. The write information of the 
instruction #9 for rO and r2 are stored into the scoreboard cells 
SBEO and SBLO . The selection of the scoreboard cell SBLO input 
follows the logic shown in Fig. 24. As the thread number THO 
== 0 and the register information validity LVO is asserted, the 
information of the instruction #9 on the pipe 0 side is selected. 

At a point of time t8 , the instruction address stages AO 
and Al of the instructions #9, #15 and #16 are implemented. The 
instruction supply part IFO performs a repeat action as in the 



preceding cycle to increase the number of repeats RCO to 5. The 
program counter PCI of the instruction supply part IFl is updated 
with the addition of 4, and a request to fetch the instructions 
#15 and #16 is issued. The branching-related instruction 
decoder BDECl decodes the LDRC instruction of the instruction 
#13, and stores 8 at the number of repeats RCl as was the case 
with the instruction #7. Also, the instruction fetch stages 
10 and II of the instructions #9 and #14 are implemented. The 
instruction supply part IFO, as it did at the point of time t, 
adds 6 to the instruction #9 as the thread synchronization number 
IDOO, and supplies the result to the instruction multiplexer 
MXO as the instruction 100. The instruction #9 then entails 
reading of the register rO , and there is a possibility of flow 
dependency occurrence . However , as the prior data load for which 
the write validity VL of the scoreboard information CM is asserted 
is for r2 , there occurs no flow dependency attributable to the 
mismatch of register numbers. Further, the instruction supply 
part IFl supplies the instruction multiplexer MXO with the 
instruction #14 , as the instruction 100 , held in the instruction 
queue IQln. As a result, the instruction multiplexers MXO and 
MXl select the instructions 100 and 110, i.e. the instructions 
#9 and #14, and supply them to the instruction decoders DECO 
and DECl . Also, as at the point of time t7 , it implements the 
instruction decode stage DO of the instruction #9 as well as 
the decode stage Dl of the instruction #13. As the instruction 



#13 is a branching-related instruction, the instruction decoder 
DECl turns the processing into an NOP . Further , the instruction 
execution stage EO of the instruction #9 is implemented. The 
instruction execution part EXO , in accordance with the control 
information CO , places the read data DRAO over the execution 
result DM0 as the load address, and supplies it to the memory 
control part MC. It also increments the read data DRAO, which 
is supplied as the execution result DEO to the register module 
RM. 

In the register scoreboard RS , at the point of time t8 , 
writes into the registers rO and r2 are stored in the cells SBEO 
and SBLO, respectively, with the read synchronization number 
of 0 as shown in Fig. 18 . Further, rO is supplied to the register 
read number RAO with the thread synchronization number of 7 . 
As the cell SBEO and the read number RAO are identical at rO 
and, though there is a difference in thread synchronization 
number, 0 versus 7, the thread numbers THEO and THO are both 
0, BPEOAO out of the bypass controls is asserted. Further in 
the scoreboard cells SBEO and SBLO, as the thread numbers THEO 
and THLO are both 1, write-backs BNEO and BNLO are negated in 
accordance with the logic shown in Fig. 25. The next stage write 
control information NLO generated by adding this write-back BNLO 
is stored into the scoreboard cell SBLl . Also, in the control 
logic CTL, as the individual thread STH is negated and the 
write-back BNEO with the thread number THEO of 0, the write 
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indication SEO is negated and the temporary buffer control CEO 
is asserted according to the sixth and seventh equations of Fig. 
27. All Sx (X = TBO, TBI, TB2 , L3 , EO, El) and Cx are negated 
because the write validity Vx is negated. As a result, as shown 
in the table of Fig. 27, the data selections MO, Ml and M2 become 
EO , TBO and TBI , respectively . Then , the next stage write control 
information units NMO , NMl and NM2 turn into NEO , NTBO and NTBl , 
respectively, and they are stored into the temporary buffer 
control information spaces SBTBO, SBTBl and SBTB2 . Further, 
the write information of the instruction #9 is stored into the 
cells SBEO and SBLO as at the point of time t7 . In the register 
module RM, in accordance with the data selections MO, Ml and 
M2 , the execution result DEO and the temporary buffer data DTBO 
and DTBl are written into the temporary buffers DTBO, DTBl and 
DTB2. Also, as the bypass control BPEOAO has been asserted, 
in the bypass multiplexer MAO, the execution result DEO is 
selected as the read data DRAO in accordance with the logic shown 
in Fig. 30. 

At a point of time t9 , the instruction address stages AO 
and Al of the instructions #9 and #15 is implemented. The 
instruction supply part IFO performs a repeat action as in the 
preceding cycle to increase the number of repeats RCO to 4 . In 
the instruction supply part IFl, the program counter PCI and 
the end of repeat address REl prove identical in the address 
of the instruction #15, and a repeat action is started, as was 
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. the case with the instruction #9 , to increase the number of repeats 
RCO to 7. 

Also, the instruction fetch stages 10 and II of the 
instructions #9, #15 and #16 are implemented. The instruction 
supply part IFO , as at the point of time t7 , adds 5 to the 
instruction #9 as the thread synchronization number IDOO, and 
supplies the resultant instruction 100 to the instruction 
multiplexer MXO . Though the instruction #9 then entails reading 
of the register rO , as the prior data load for which the write 
validities VL and VLO are asserted is for r2 , there occurs no 
flow dependency attributable to the mismatch of register numbers . 
The memory control part MC fetches the instructions #15 and #16, 
and the instruction supply part IFl stores them into the 
instruction queue IQln and, at the same time, supplies them as 
the instructions 110 and 111 to the instruction multiplexers 
MXl and MXO. As the instructions 110 and 111 immediately follow 
a fetch, the instruction multiplexer MXl performs no 
executability determination. As a result, the instruction 
multiplexers MXl and MXO select the instructions TOO and 110, 
i.e. the instructions #9 and #15, and supply them to the 
instruction decoders DECO and DECl . Further, as at the point 
of time t7, the instruction decode stage DO of the instruction 
#9 is also implemented. Also, the instruction decoder DECl 
implements the instruction decode stage Dl of the instruction 
#14. As the instruction #14 is for NOP, the control information 
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CI carries out NOP processing. Further, as at the point of time 
t8, the instruction execution stage EO of the instruction #9 
is implemented. Also, the memory control part MC performs the 
data load stage LI of the instruction #9. 

The state of the register scoreboard RS at the point of 
time t9 is as shown in Fig. 18. As at the point of time tS, 
the bypass control BPEOAO is asserted. Also, the cell SBTBO 
and the read number RAO become identical at rO and, as the thread 
numbers THTBO and THO are both 0, the bypass control BPTBOAO 
is asserted. As at the point of time tS , the write-backs BNEO 
and BNLO are negated, the cell SBLl is updated, the write 
indication SEO is negated, and the temporary buffer control CEO 
is asserted. Further, in the cells SBLl and SBTBO, as the thread 
numbers THLl and THTBO are 1, the write-backs BNLl and BNTBO 
continue to be negated in accordance with the logic shown in 
Fig. 26. 

The next stage write control information NLl generated 
by adding this write-back BNLl is stored into the scoreboard 
cell SBL2 . Then, the write indication STBO is negated according 
to the sixth and seventh equations of Fig. 27, and the temporary 
buffer control CTBO is asserted. As a result, as shown in the 
table of Fig. 27, the data selections MO, Ml and M2 become EO, 
TBI and TB2 , respectively, as at the point of time tS , and 
consequently the temporary buffer control information units 
SBTBO, SBTBl and SBTB2 are updated. Further, the write 
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information of the instruction #9 is stored into the cells SBEO 
and SBLO as at the point of time t7 . In the register module 
RM as well, as at the point of time t8 , the temporary buffers 
DTBO, DTBl and DTB2 are updated in accordance with the data 
selections MO , Ml and M2 . Further , as the bypass controls BPEOAO 
and BPTBOAO have been asserted, the execution result DEO is 
selected as the read data DRAG is selected in the bypass 
multiplexer MAO in accordance with the logic shown in Fig. 30. 
In the temporary buffer TB then, the temporary buffer data DTBO 
are read by the bypass control BPTBOAO as the temporary buffer 
read data TBAO , and in the bypass multiplexer MAO, too, BPTBAO 
is asserted. However, as the bypass control BPEOAO is also 
asserted, a new execution result DEO is selected in accordance 
with the logic shown in Fig. 30. 

At a point of time tlO , the instruction address stages 
AO and Al of the instructions #9 and #15 are implemented. The 
instruction supply part IFO performs a repeat action as in the 
preceding cycle to increase the number of repeats RCO to 4 . The 
instruction supply part IFl, though it performs a repeat action 
as in the preceding cycle, keeps the number of repeats RCO 
unchanged at 7 because the register scoreboard RS asserts the 
stall STLl to be explained later. Also, the instruction fetch 
stages 10 and II of the instructions #9 , #15 and #17 are implemented. 
The instruction supply part IFO, as at the point of time t7 , 
adds 4 to the instruction #9 as the thread synchronization number 
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IDOO and supplies it to the instruction multiplexer MXO as the 
instruction 100 . Though the instruction #9 then entails reading 
of the register rO , as the prior data load for which the write 
validities VL, VLO and VLl are asserted is for r2 , there occurs 
5 no flow dependency attributable to the mismatch of register 
numbers. The memory control part MC fetches the instruction 
#17 and the next instruction, and the instruction supply part 
IFl stores them into the instruction queue IQln. and, at the 
same time, supplies them as the instructions 110 and 111 to the 
==10 instruction multiplexers MXl and MXO. It also supplies the 
Z ; instruction #15 to the instruction multiplexer MXl as the 

instruction 110. Although the instruction 110 then, i.e. the 
n instruction #15, entails reading of the registers r2 and r3 , 

k,i as the prior data loads for which the write validities VL, VLO 

OJ 

fi|15 and VLl are asserted are the thread synchronization numbers 7, 

Q 

O 6 and 5, there occurs no flow dependency. As this is a repeat 

action the instruction immediately following the instruction 
#15 is not the instruction #16. Accordingly there is no 
instruction to be supplied as the instruction 111, and the 

20 instruction validity IVll of the instruction 111 is negated. 
As a result, the instruction multiplexers MXl and MXO select 
the instructions 100 and 110, i.e. the instructions #9 and #15, 
and supply them to the instruction decoders DECO and DECl . 
Further, as at the point of time t7 , the instruction decoder 

25 DECO implements the instruction decode stage DO of the 
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instruction #9 and the instruction decode stage Dl of the 
instruction #15. As the instruction #15 is an instruction to 
add the registers r2 and r3 and to store the sum at r3 , its control 
information CI is supplied. Further, as RAO is used for the 
5 read and write control of r3 and RBO , for the read control of 
r2 , VAO , VBO and VO out of the register information validity 
VRl are asserted. Also, as at the point of time t8 , the 
instruction execution stage EO of the instruction #9 is 
implemented. Further, the memory control part MC performs the 
10 data load stages LI, L2 and L3 of the instruction #9. 

The state of the register scoreboard RS at the point of 
time tlO is as shown in Fig. 18. As at the point of time t9 , 
the bypass controls BPEOAO and BPTBOAO are asserted. Also, as 
the cell SBTBl and the number RAO become identical at rO and 

_15 the thread numbers THTBl and THO are both 0 , the bypass control 
BPTBIAO is asserted. Further, as the cell SBL2 and the read 
number RBI of the instruction #15 become identical at r2 and 
the thread synchronization numbers IDL2 and IDl are both 0 , the 
bypass control BPL2B1 is asserted. Then, the stall STLl is 

20 asserted in the scoreboard control part CTL, the instruction 
#15 is deterred from execution, and the write validity to be 
written into the scoreboard cell SBEl is negated. Also, as at 
the point of time t9 , the write-backs BNEO, BNLO , BNLl and BNTBO 
are negated, the cells SBLl and SBL2 are updated, the write 

25 indications SEO and STBO are negated, and the temporary buffer 
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controls CEO and CTBO are asserted. Further, in the cells SBL2 
and SBTBl, as the thread number THl is 1 and the thread 
synchronization numbers IDL2 and IDTBl are identical with IDl, 
all being 0, the write-backs BNL2 and BNTBl are asserted in 

5 accordance with the logic shown in Fig. 26 . The next stage write 
control information NL2 generated by adding this write-back BNL2 
is stored into the scoreboard cell SBL3 . Then, the write 
indication STBl is asserted according to the sixth and seventh 
equations of Fig. 27, and the temporary buffer control CTBl is 

10 negated. As a result, as shown in the table of Fig. 27, the 
data selections MO , Ml andM2 become EO , TBI and TB2 , respectively , 
as at the point of time t8 , and consequently the temporary buffer 
control information units SBTBO , SBTBl and SBTB2 are updated. 
Further, the write information of the instruction #9 is stored 

15 into the cells SBEO and SBLO as at the point of time t7 . In 
the register module RM as well, as at the point of time t8 , the 
temporary buffers DTBO , DTBl and DTB2 are updated in accordance 
with the data selections MO , Ml andM2. Then the temporary buff er 
data DTBl are written back into the register rO of the register 

20 file RF by the write indication STBl. Further, as the bypass 
controls BPEOAO, BPTBOAO and BPTBIAO have been asserted, the 
execution result DEO is selected as the read data DRAO in the 
bypass multiplexer MAO in accordance with the logic shown in 
Fig. 30. In the temporary buffer TB then, the temporary buffer 

25 data DTBO are read by the bypass controls BPTBOAO and BPTBIAO 
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as the temporary buffer read data TBAO, and in the bypass 
multiplexer MAO , too, BPTBAO is asserted . However, as the bypass 
control BPEOAO is also asserted, the latest execution result 
DEO is selected in accordance with the logic shown in Fig. 30. 
5 At a point of time til, the instruction address stages 

AO and Al of the instructions #9 and #15 are implemented. The 
supply part IFO performs a repeat action as in the preceding 
cycle to increase the number of repeats RCO to 4 . The supply 
part IFO again performs a repeat action as at the point of time 

10 9 to increase the number of repeats RCO to 6. Also, the 

instruction fetch stages 10 and II of the instructions #9 and 
#15 are implemented. The instruction supply part IFO, as at 
the point of time t7 , adds 4 to the instruction #9 as the thread 
synchronization number IDOO and supplies it to the instruction 

15 multiplexer MXO as the instruction 100. As at the point of 
time tlO, no flow dependency then occurs to the instruction #6. 
The instruction supply part IFl adds 7 to the instruction #15 
as the thread synchronization number IDOl and supplies it to 
the instruction multiplexer MXl as the instruction 110. As at 

20 the point of time tlO , no flow dependency occurs to the instruction 
#1 . As a result , the instruction multiplexers MXl and MXO select 
the instruction 100 and 110, i.e. the instructions #9 and #15, 
and supply them to the instruction decoders DECO and DECl . 
Further, as at the point of time t7 , the instruction decoders 

25 DECO implements the instruction decode stage DO of the 
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instruction #9 . It also implements the instruction decode stage 
Dl of the instruction #15. As the instruction #15 was prevented 
in the preceding cycle by the stall STLl from execution, the 
instruction decoder DECl does not update input instruction, and 
5 instead supplies again the decoded result of the instruction 
#15 . Also , as at the point of time t8 , the instruction execution 
stage EO of the instruction #9 is implemented. Further, the 
memory control part MC implements the data load stages LI, L2 
and L3 of the instruction #9. 
10 The state of the register scoreboard RS at the point of 

time til is as shown in Fig. 18. Incidentally , as the ins truction 
#15 was prevented from execution in the preceding cycle by the 
assertion of the stall STLl, the register information MRl is 
not updated. As at the point of time t9 , the bypass controls 
15 BPEOAO, BPTBOAO and BPTBOAl are asserted. Also, the cell SBTB2 
and the read number RAO become identical at rO and, as the thread 
numbers THTB2 and THO are both 0, the bypass control BPTB2A0 
is asserted. Further, the cell SBL3 and the read number RBI 
become identical at r2 and, as the thread synchronization numbers 
20 IDL3 and IDl are both 0, the bypass control BPL3B1 is asserted. 
Also, as at the point of time t9 , the write-backs BNEO, BNLO , 
BNLl and BNTBO are negated, the cells SBEO, SBLO, SBLl and SBL2 
are updated, the write indications SEO and STBO are negated, 
and the temporary buffer controls CEO and CTBO are asserted. 
25 Further, as the thread numbers THL2 and THTBl are 1 in the cells 
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SBL2 and SBTBl , the write-backs BNL2 and BNTBl continue to be 
negated in accordance with the logic shown in Fig. 26. Also, 
as the thread synchronization number IDL3 and IDTB2 are identical 
with IDO , all being 0 , in the cells SBL3 and SBTB2 , the write-backs 
5 BNL3 and BNTB2 are asserted in accordance with the logic shown 
in Fig. 26 . Then the write indications SL3 and STBl are asserted 
according to the sixth and seventh equations of Fig. 27, and 
the temporary buffer controls CL3 and CTB2 are negated. As a 
result, as shown in the table of Fig. 27, the data selections 
=ilO MO, Ml and M2 become EO , TBI and TB2 , respectively, as at the 
point of time t8 , and consequently the temporary buffer control 
information units SBTBO , SBTBl and SBTB2 are updated. In the 
register module RM as well, as at the point of time t8 , the 
temporary buffers DTBO , DTBl and DTB2 are updated in accordance 
15 with the data selections MO, Ml and M2 . Then the load data DL3 
and the temporary buffer data DTB2 are written back into the 
registers r2 and rO of the register file RF by the write indications 
SL3 and STB2 . Further, as the bypass controls BPEOAO, BPTBOAO 
and BPTBIAO have been asserted, the execution result DEO is 

20 selected as the read data DRAO in the bypass multiplexer MAO 
in accordance with the logic shown in Fig. 30. In the temporary 
buffer TB then, the temporary buffer read data DTBO are read 
by the bypass controls BPTBOAO , BPTBIAO and BPTB2A0 as the 
temporary buffer read data TBAO , and in the bypass multiplexer 

25 MAO, too, BPTBAO is asserted. However, as the bypass control 



63 



BPEOAO is also asserted, the latest execution result DEO is 
selected in accordance with the logic shown in Fig. 30. Also, 
as the bypass control BPL3B1 has been asserted, in the bypass 
multiplexer MBl , the load data DL3 are selected as the read data 
5 DRBl in accordance with the logic shown in Fig. 30. The read 
data DRAl are read out of the register r3 of the register file 
RF. 

At a point of time tl2, as at the point of time til, the 
instruction address stages AO and Al and the instruction fetch 

H 10 stages 10 and II of the instructions #9 and #15 are implemented. 

|:j Further, as at the point of time tlO, the instruction decode 

:^ stages DO and 01 of the instructions #9 and #15, the instruction 

execution stage EO of the instruction #9 and the data load stages 

t.A LI, L2 and L3 of the instruction #9 are implemented. Then, the 

hi 

f y 15 execution stage El of the instruction #15 is implemented. In 
O the instruction execution part EXl , the read data DElAl and DRBl 

are added, and the sum is supplied to the execution result DEI. 

The state of the register scoreboard RS at the point of 
time tl2 is as shown in Fig. 18. Though it is substantially 
20 the same as at the point of time til except that the thread 
synchronization number is less by 1, the write information for 
the register r3 of the cell SBEl is greater. Then, the cell 
SBEl and the read number RBO become identical at r3 and, as the 
thread numbers THEl and THl are both 0 , the bypass control BPElAl 
25 is asserted. As at the point of time til, each cell in the 
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scoreboard is updated. In the register module RM, too, as at 
the point of time til, the temporary buffer TB and the registers 
r2 and rO of the register file RF are updated, and the read data 
DRAO and DRBl are selected. Also, as the bypass control BPElAl 
5 has been asserted, in the bypass multiplexer MAI, the execution 
result DEI is selected as the read data DRAl in accordance with 
the logic shown in Fig. 30. 

At a point of time tl3, the instruction address stages 
AO and Al of the instructions #9 and #15 are implemented. The 

°10 instruction supply part IFO , though it performs a repeat action 
as in the preceding cycle, as the number of repeats RCO is 1, 

I the output of a nuraber-of-repeats comparator CCO is 1 and the 

AND gate is 0, with the result that the instruction address 
multiplexer MRO indicates the address + 4 of the instruction 

L115 #9, i.e. the instruction next to the instruction #10 , and releases 

- the instructions of the instruction buffer from #9 onward from 

their held state. The number of repeats RCO is decremented to 
0. Incidentally, the description of the instruction next to 
#10 and the following instructions will be dispensed with at 

20 and after the point of time tl4. The instruction supply part 
IFl, as at the point of time t9 , a repeat action to increase 
the number of repeats RCO to 4 . As at the point of time tl2, 
the instruction fetch stages 10 and II, the instruction decode 
stages DO and Dl and the instruction execution stages EO and 

25 El of the instructions #9 and #15, together with the data load 
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Stages LI, L2 and L3 of instruction #9, are implemented. 

The state of the register scoreboard RS at the point of 
time tl3 is as shown in Fig. 18. It is the same as at the point 
of time tl2 except that the thread synchronization number is 
5 less by 1. Then, as at the point of time tl2 , each cell in the 
scoreboard is updated, and the temporary buffer TB and the 
register file RF in the register module RM are updated, with 
the read data DRAO , DRAl and DRBl being selected. 

At a point of time tl4 , as at the point of time tl3, the 

10 instruction address stage Al and the instruction fetch stage 
II of the instruction #15, the instruction decode stage DO and 
Dl and the instruction execution stages EG and El of the 
instruction #9 and the instruction #15 and the data load stages 
LI, L2 and L3 of the instruction #9 are implemented. Further, 

15 as the process has been released from the repeat mode , instruction 
#10 is decoded by the branching-related instruction decoder BDECO 
to perform SYNCE instruction processing. The SYNCE instruction 
is an instruction to wait for the completion of a data using 
thread. The data using thread, i.e. the thread 1, as the thread 

20 synchronization number IDl returns to 0 at the end of repeat, 
will if the thread synchronization number IDO remains at 0 on 
account of the rule that the data use thread should not pass 
the data defining thread. Therefore, the instruction 
multiplexers MXO and MXl are so controlled as to override this 

25 rule from the time of decoding the SYNCE instruction until the 
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end of the data using thread. This control, as it is utilized 
from the instruction #16 , it is stated as the instruction address 
stage Al of the instruction #16 in Fig. 18. 

The state of the register scoreboard RS at the point of 
5 time tl4 is as shown in Fig. 18. It is the same as at the point 
of time tl3 except that the thread synchronization number is 
less by 1. Then, as at the point of time tl3, each cell in the 
scoreboard is updated, and the temporary buffer TB and the 
register file RF in the register module RM are updated, with 

10 the read data DRAO , DRBl and DRAl being selected. 

At a point of time tl5, as at the point of time tl4 , the 
instruction address stage Al , the instruction fetch stage II 
and the instruction decode stage Dl of the instruction #15, the 
instruction execution stages EO and El of the instruction #9 

15 and the instruction #15 and the data load stages LI, L2 and L3 
of the instruction #9 are implemented. 

The state of the register scoreboard RS at the point of 
time tl5 is as shown in Fig. 18. It is the same as at the point 
of time tl4 except that the thread synchronization number is 

20 less by 1 and rO is not read at RAO. Then, as at the point of 
time tl4, each cell in the scoreboard is updated, though no new 
write information is held in the scoreboard cells SBEO and SBLO 
and these cells are invalidated. Also, the temporary buffer 
TB and the register file RF in the register module RM are updated, 

25 and the read data DEIA.1 and DRBl are selected. 



At a point of time tl6, as at the point of time tl5, the 
instruction address stage Al , the instruction fetch stage II, 
the instruction decode stage Dl and the instruction execution 
stage El of the instruction #15 and the data load stages LI, 
L2 and L3 of the instruction #9 are implemented. At the 
instruction address stage Al , though the instruction supply part 
IFl performs a repeat action as in the preceding cycle, as the 
number of repeats RCO is 1, the output of the number-of-repeats 
comparator CCO is 1 and the AND gate is 0 , with the result that 
the instruction address multiplexer MRl indicates the address 
+ 4 of the instruction #15 , i.e. the instruction #17 , and releases 
the instructions of the instruction buffer from #15 onward from 
their held state. The number of repeats RCO is decremented to 
0. 

The state of the register scoreboard RS at the point of 
time tl6 is as shown in Fig. 18. It is the same as at the point 
of time tl5 except that the thread synchronization number is 
less by 1 and the cells SBEO and SBLO are invalidated. Then, 
as at the point of time tl5 , each cell in the scoreboard is updated, 
though no new write information is held in the scoreboard cells 
SBLl and SBTBO and these cells are invalidated. Also, the 
temporary buffer TB and the register file RF in the register 
module RM are updated , and the read data DRAl and DRBl are selected , 
though no writing into the register r2 is done. 

At a point of time tl7, as at the point of time tl6, the 



instruction fetch stage II, the instruction decode stage Dl and 
the instruction execution stage El of the instruction #15 and 
the data load stages L2 and L3 of the instruction #9 are 
implemented . 

The state of the register scoreboard RS at the point of 
time tl7 is as shown in Fig. 18. It is the same as at the point 
of time tl6 except that the thread synchronization number is 
less by 1 and the cells SBIO and SBTBO are invalidated. Then, 
as at the point of time tl6 , each cell in the scoreboard is updated , 
though no new write information is held in the scoreboard cells 
SBL2 and SBTBl and these cells are invalidated. Also, the 
temporary buffer TB and the register file RF in the register 
moduleRM are updated, and the read data DRAl andDRBl are selected. 

At a point of time tl8, the instruction fetch stage II 
of the instruction #16 is implemented. The instruction supply 
part IFl supplies the instruction #16 of the instruction queue 
IQln to the instruction decoders DECl via the instruction 
multiplexer MXl as the instruction 110. Although the thread 
synchronization number then is 0, the same as the data defining 
thread, the data defining thread side is waiting for the 
completion of the data using thread in accordance with the SYNCE 
instruction, and an instruction of the same thread 
synchronization number can now be issued. Also, as at the point 
of time tl7 , the instruction decode stage Dl and the instruction 
execution stage El of the instruction #15 and the data load stage 
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L3 of the instruction #9 are implemented. 

The state of the register scoreboard RS at the point of 
time tl8 is as shown in Fig. 18. It is the same as at the point 
of time tl7 except that the thread synchronization number is 
5 less by 1 and the cells SBL2 and SBTBl are invalidated. Then, 
as at the point of time tl7 , each cell in the scoreboard is updated , 
though no new write information is held in the scoreboard cells 
SBL3 and SBTB2 and these cells are invalidated. Also, the 
temporary buffer TB and the register file RF in the register 

-10 module RM are updated, and the readdataDRAl andDRBl are selected. 

At a point of time tl9 , the instruction decode stage Dl 
of the instruction #16 is implemented. The instruction #16 is 
an instruction to store the contents of the register r3 at an 
address indicated by the register rl . The instruction decoder 
15 DECl supplies the control information CI for this purpose . Also, 
out of the register validities VRl , VAl and Vbl are asserted. 
As at the point of time tl7, the instruction execution stage 
El of the instruction #15 is implemented. Also, the 
branching-related instruction decoder BDECl of the instruction 

20 supply part IFl decodes THRDE of the instruction #17, stops the 
instruction supply part IFl, and asserts the end of thread ETHl, 
The state of the register scoreboard RS at the point of 
time tl9 is as shown in Fig. 18. It is the same as at the point 
of time tl8 except that the thread synchronization number is 

25 less by 1, the cells SBL3 and SBTB2 are invalidated, and the 



register read numbers RAl and RBI are different. Then, as at 
the point of time tl8, each cell in the scoreboard is updated, 
though no new write information is held in the scoreboard cell 
SBEl and this cell is invalidated. Also, the register file RF 
in the register module is updated, though only the register 
r3 is updated. Further, the read data DRAl are read out of rl 
in the register file RF, and the cell SBEl and the register number 
of the read number RBI become identical at r3 , and the thread 
numbers THEl and THl become identical with the result that the 
bypass control BPEIBI is asserted, and the execution result DEI 
is selected in the read data multiplexer MBl as DRBl. 

At a point of time t20, the instruction execution stage 
El of the instruction #16 is implemented. The read data DRAl 
are supplied to the execution result DEI as a store address in 
accordance with the control information CI , and the read data 
DRBl are supplied to the execution result DM1 as data. Also, 
as the end of thread ETH has been asserted, the scoreboard control 
CTL asserts the individual thread STH in accordance with the 
fifth equation shown in Fig. 27 . 

As described so far, the multi-thread system of this 
embodiment of the invention can conceal the data load time. 

In this embodiment of the invention, the data defined by 
the data defining thread and written into the temporary buffer 
TB of the register module RM are not used by the data using thread . 
The data used by the data using thread are load data, which are 
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used immediately after their loading and directly written into 
the register file RF . Where the temporary buffers are wastefully 
used in this way, if the data load time is extended, even more 
buffers will be needed for wasteful writing. If the data load 
5 time is 30 units, executing the program of Fig. 16 without a 
stall by a temporary buffer-full STLTB would require 29 temporary 
buffers. Since data in temporary buffers have to be read out 
under bypass control as required and supplied to the instruction 
execution part, an increase in the number of temporary buffers 
-10 would mean an increased hardware volume and a drop in execution 
speed. A way to avoid such problems is to confine the register 
to be defined by the data defining thread and used by the data 
using thread. 

For instance, a specific register or group of registers 
15 can be assigned as the link register (s) by a link register 
assigning instruction, and it is assigned only the assigned link 
register (s) can be used for data transfers between threads . Then, 
if the program of Fig. 16 is used, r2 is assigned as the link 
register. In this way, other registers than r2 will need no 
20 consideration about reverse dependency and output dependency 
between threads , and therefore execution results can be directly 
written into the register file RM. Then, the use of temporary 
buffers in the pipeline operation of Fig. 18 will be totally 
eliminated. 

25 In this case, where the data load time is 30 units, for 
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the execution of the program of Fig. 16 without stall, 30 load 
stages will be sufficient with the addition of L4 through L29 . 
In this connection, SBL4 through SBL29 are added to the register 
scoreboard. Then, bypass controls from SBLO through SBL28 will 
5 all be reflected only in the stalls STLO and STLl , and there 
will be no increase in the number of data bypasses. 

For a conventional processor, there are a plurality of 
definitions of the data load time, for a case in which an on-chip 
cache is hit, one in which it is in an on-chip memory, one in 

=_;10 which an off-chip cache is hit, one in which it is in an off-chip 
memory and so forth. For instance, where the data load time 
can be 2 , 4, 10 or 30 units, by providing bypasses matching SBLl , 

; j SBL3 , SBL9 and SBL29 and differentially using a stall or a bypass 

according to the length of the data load time, the present 

' jl5 invention can be adapted to a plurality of data load time lengths . 

p In addition, though not defined for this embodiment of the 

invention, there are arithmetic instructions taking a long time 
to execute, such as division instructions. It is readily 
possible for persons decently skilled in the art to realize 
20 similar hardware for such instructions to that for data loading. 

Although the threads 0 and 1 are fixed as a data defining 
thread and a data using thread, respectively, according to this 
embodiment, eliminating this fixation is readily possible for 
persons decently skilled in the art as stated above. It is also 
25 conceivable to configure a program in which , after the completion 
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of processing of the data defining thread, this thread is ended 
by a THRDE instruction, to use the data using thread as a new 
data defining thread , actuate a new thread by a THRDG instruction , 
and assign the actuated thread as the new data using thread. 
5 In this way, the SYNCE instruction used in this embodiment can 
be dispensed with, the period during which only one thread is 
available can be shortened, and the performance can be 
correspondingly enhanced. 

In addition , this embodiment supposes one-way flow of data , 

10 but the link register assignment described above would make 
possible two-way data communication as well. A different link 
register is assigned to each direction, a data definition 
synchronizing instruction SYNCD is issued upon completion of 
the execution of the data defining instruction for the link 

15 register by each thread , and a data use synchronizing instruction 
SYNCU is issued upon completion of the use of the link register. 
Then, the thread synchronization number is updated at the time 
of issuing the SYNCU instruction. Instead of the SYNCU 
instruction, repeating can be used for synchronization as in 

20 this embodiment. Two-way exchanging of data in a plurality of 
threads would be effective in simultaneous processing of loose 
coupling in which data dependency is scarce by does exist. Fig. 
31 illustrates a flow or program processing in an inter-thread 
two-way data communication system. 

25 First, r2 is assigned for the direction from the thread 
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THO to the thread THl and r3 for the other direction as the link 
register by a link register assigning instruction RNCR. Then, 
link register defining instructions #01 and #11 are executed 
in the threads THO and THl, respectively. After that, a data 
5 definition synchronizing instruction SYNCD is issued to execute 
link register use instructions #0t and #ly, respectively. 
Finally, a data use synchronizing instruction SYNCU is issued. 
The execution time may vary from one thread to another. A case 
in which the execution of the thread THl is quicker than the 

10 thread THO is shown in THl. a of Fig. 31. In this case, as the 
link register use instruction #ly of the thread THl waits of 
the issue of the thread THO data definition synchronizing 
instruction SYNCD, there will be no wrong detection of flow 
dependency. The contrary case in which the execution of the 

15 thread THl is shown in THl.b of Fig. 31. In this case, as the 
link register use instruction #lt of the thread THO waits for 
the issue of the thread THl data definition synchronizing 
instruction SYNCD, there will be wrong detection of flow 
dependency. The data definition synchronizing instruction 

20 SYNCD has changed the order of execution priority between the 
threads . It has to be noted , however , that the execution priority 
in this example differs from one link register to another. For 
r2 , the thread THO is given priority over THl, and for rS , the 
thread THl is prior to THO. 

25 While inter-thread data communication is carried out via 
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registers in this embodiment of the invention, it is readily 
possible for persons decently skilled in the art to accomplish 
inter-thread data communication via memories by managing 
memories by the use of the whole or part of memory addresses 
5 instead of register numbers. 

The present invention makes it possible for achieving 
performance standards comparable to large-scale out-of-order 
execution or software pipelining with simple and small hardware 
by adding only a simple control mechanism to a conventional 

10 multi-thread processor. Furthermore, a level of performance 
which a conventional multi-thread processor cannot achieve with 
simultaneous or time multiplex execution of many threads can 
be attained with only two or so threads according to the invention . 
The overhead burden of thread generation and completion can be 

15 reduced correspondingly to the reduction in the number of threads , 
and the hardware for storing the states of many threads can also 
be saved. 



